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Background: The development of high-throughput omics technologies enabled genome-wide measurements of 
the activity of cellular elements and provides the analytical resources for the progress of the Systems Biology 
discipline. Analysis and interpretation of gene expression data has evolved from the gene to the pathway and 
interaction level, i.e. from the detection of differentially expressed genes, to the establishment of gene interaction 
networks and the identification of enriched functional categories. Still, the understanding of biological systems 
requires a further level of analysis that addresses the characterization of the interaction between functional modules. 

Results: We present a novel computational methodology to study the functional interconnections among the 
molecular elements of a biological system. The PANA approach uses high-throughput genomics measurements 
and a functional annotation scheme to extract an activity profile from each functional block -or pathway- followed 
by machine-learning methods to infer the relationships between these functional profiles. The result is a global, 
interconnected network of pathways that represents the functional cross-talk within the molecular system. We 
have applied this approach to describe the functional transcriptional connections during the yeast cell cycle and to 
identify pathways that change their connectivity in a disease condition using an Alzheimer example. 

Conclusions: PANA is a useful tool to deepen in our understanding of the functional interdependences that 
operate within complex biological systems. We show the approach is algorithmically consistent and the inferred 
network is well supported by the available functional data. The method allows the dissection of the molecular 
basis of the functional connections and we describe the different regulatory mechanisms that explain the 
network's topology obtained for the yeast cell cycle data. 



Introduction 

The analysis of genome-wide transcriptomics data has 
changed in the last decade from a gene-centric vision, 
which evaluated thousands of gene expression changes in 
parallel, to a systems biology orientation where coordina- 
tion among gene activities is pivotal. In light of this, data 
is analyzed from the perspective that genes do not act as 
independent entities, but as groups of cooperating mole- 
cules that define the cellular state [1,2]. Functional 
Enrichment [3] and Gene Set Enrichment Analysis 
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(GSEA) [4], collectively denoted here as Enrichment 
Analysis (EA), are the paradigm of such vision. The EA 
relies on the definition of gene sets or pathways as blocks 
of genes that either share a cellular role or are sequen- 
tially connected to perform a given cellular function. EA 
methods have been developed with different adaptations 
to consider specific data structures such as regulatory 
networks [5], time series measurements [6], SNP data [7] 
or multifactorial designs [8], but they all attempt to iden- 
tify gene sets whose global (de)activation is associated 
with the phenotype. Pathway databases such as KEGG, 
Reactome, BioCarta or the Gene Ontology host func- 
tional data and provide the annotation framework to 
define gene sets for enrichment analysis. 
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EA methods implicitly work under two assumptions. On 
the one hand, they consider that all genes in a gene set or 
a pathway equally contribute to the activity of that path- 
way; hence, the pathway is activated when a "sufficient" 
number of gene members is activated. This consideration 
does not take into account the differential regulatory fac- 
tors that modulate each gene's participation in the path- 
way, such as different translation rates, enzymatic and 
complex-association kinetics or the quite versatile regula- 
tory capacity of genes. An example of this last type is the 
heme biosynthesis pathway. This pathway involves eight 
enzymatic steps to transform succynil-coA and glycine 
into heme, the first being the synthesis of aminolevunilic 
acid by ALAS (aminolevunilic acid synthase), which is the 
committed step of the heme synthesis pathway and is 
usually rate-limiting for the overall pathway [9]. Hence, 
heme production is mostly controlled by ALAS regulation 
and not by a majority of pathway members. Furthermore, 
the variability in expression of human genes has been pre- 
viously evaluated by our group across thousands of micro- 
array experiments. The analysis demonstrated the 
constant expression of certain gene sets and we proposed 
a weighting scheme to account for the differential regula- 
tion capacity of genes within pathways [10]. Moreover, we 
have observed that gene regulation is associated with the 
network properties of the gene. Genes with a high cluster 
coefficient tend to show less pronounced variations at the 
transcript levels than those genes with lower connectivity 
[11] (Montaner, unpublished). All these examples illustrate 
the heterogeneous regulation capacity of genes within one 
pathway and their potentially differential contribution to 
its regulation. 

The second assumption of EA methods is that pathways 
are generally considered as isolated boxes, and the interac- 
tions between them are normally not explored. However, 
pathways should be understood as a formalization of our 
understanding of cell biology and hence their boundaries 
are arbitrary or, actually, non-existing [12]. In fact, inter- 
connections between genes and proteins go beyond path- 
way definitions and are condition dependent. Formal 
pathways may interact through either shared components 
(for example, purine and pyrimidine biosynthetic pathways 
share around 40 genes) or regulatory mechanisms (a path- 
way output might regulate or interact with proteins in a 
second pathway). Moreover, pathways may be connected 
by interaction elements that have not been discovered yet; 
for example, through regulation by non-coding RNAs 
such as miRNAs [13,14]. 

There are some recent examples in the literature of 
methodologies that analyze pathway interactions to 
understand the cross-talk between the functional blocks 
of a cellular system. Tools like ClueGO [15] and Enrich- 
mentMap [16] display pathway connections by analyzing 
the overlapping between their annotated genes. Li and 



Agarwal [17] constructed a Pathway Consensus Network 
(PCN) from the physical interactions between proteins 
belonging to different pathways and used this global 
pathway interactome to map cancer genes in order to 
understand the progression of this disease. Huang and 
Li [18] extended this concept by incorporating gene 
expression data to define active protein-protein interac- 
tions and to identify phenotype-specific sub-pathway 
networks. More recently, Kelder et al [19] obtained 
additional links between pathways by searching for con- 
necting paths that include not-yet-annotated proteins. 
Dutta et al [20] used the connectivity information in 
canonical pathway descriptions to identify study-relevant 
pathways and to characterize dependencies and connec- 
tions among pathways using gene expression data. Liu 
et al [21] construct a pathway interaction network based 
protein-protein interactions and cellular pathways, 
which is applied to the identification of deregulated 
pathways as subnetworks using gene expression data. 

In general, these methodologies rely on the selection of 
differentially expressed genes, enriched pathways and 
protein-protein interactions. Therefore, when two path- 
ways have few described protein interaction links, but 
still functionally influence each other, their connection 
might be missed by these methods [17]. In this work, we 
present a novel approach to infer pathway interaction 
networks from gene expression data that relies on a new 
concept for understanding pathway activity and relation- 
ships. This approach considers the activation pathway as 
a coordinated and relevant change in the expression 
levels of some of their genes over a number of samples. 
Unlike EA methods, it does not explicitly require a 
majority of pathway genes being activated, but that some 
covariant expression profiles can be identified. The 
method defines a pathway level gene expression signa- 
ture, or profile, that globally represents the main tran- 
scriptional regulation patterns within the pathway. Once 
pathway profiles have been defined, these are used to 
find connections between pathways. In this view, pathway 
links do not either depend on previous knowledge about 
protein-protein interactions or focus on identifying the 
genes shared between pathways, as it is the case with cur- 
rent pathway interaction approaches, but depend solely 
on pathway expression profiles. 

In previous work, we used dimension reduction tech- 
niques to obtain pathway expression profiles, which are 
associated with a physiological outcome [22]. This 
reduction strategy has also been used in other scenarios 
to obtain pathway activity indexes linked to the toxico- 
logical properties of chemical compounds [23]. Pathway 
connections were obtained by a machine-learning 
method, which has been previously applied to identify 
gene networks [24]. We use well-studied data in the 
yeast cell cycle to demonstrate our methodology, discuss 
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some relevant network links and provide guidelines to 
help interpreting the results. Finally we include an 
example of an Alzheimer gene expression dataset to 
illustrate how our method can be effective in revealing 
differential pathway connections associated to disease. 

Results 

The pathway network approach is numerically and 
biologically consistent 

Our PAthway Network Analysis approach (PANA) con- 
sists of two basic steps (Figure 1). First, transcriptomics 
data is mapped to a pathway database to generate a set of 
gene expression submatrices, one per pathway, contain- 
ing the expression values of the genes annotated to each 
pathway. Principal component analysis (PCA) is then 
applied to each submatrix to compress pathway informa- 
tion into a reduced number of expression profiles that 
characterize the pathways' gene expression changes. 
Numerically, these pathway profiles (PPs) are the scores 
of the principal components (PC) of the PCA, which are 
selected on the basis of a significance threshold alpha. 
The second step consists of obtaining a set of association 
rules that establish pair-wise connections between PPs. 
Direct and opposite rules are extracted, representing 
positive and negative correlation, respectively, between 



PPs. The quality value of the association rule is deter- 
mined by its accuracy, which is basically a measure of the 
predictive performance of a rule in terms of the sensitiv- 
ity and specificity metrics, commonly used in machine 
learning [25]. 

The performance of PANA approach was assessed 
both in terms of the formal characteristics of the 
inferred networks and in terms of functional consistency 
by comparing results obtained for the yeast cell cycle 
data against a database of yeast functional data. 

Evaluation of PANA network properties 
Simulated datasets 

We used a simulated dataset to evaluate how different 
pathway factors, namely the number of genes in the path- 
way, the type of pathway profile and the percentage of 
pathway inner correlation (defined as the percentage of 
genes in the pathway that follow the main pathway pro- 
file) would affect the network results. The simulated 
dataset contained 24,990 genes and 36 samples. Pathways 
were defined as blocks of genes of different size. Each 
pathway was assigned a different simulated expression 
profile (SEP) out of seven possibilities (Additional file 1, 
Table SI) and also each pathway contained a different 
percentage of correlated genes. Table 1 shows the 
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Figure 1 Representation of the PANA algorithm. Given a database of gene annotations into pathways and a gene expression experiment, an 
expression submatrix is generated per pathway by collecting the expression profiles of the genes annotated to each pathway. By applying PCA 
and bootstrapping to the expression submatrices, the Pathway Level Matrix is computed where row represent Pathway Profile. Next, a list of 
association rules between the pathway profiles is inferred by using GRNCOP. Finally, the pathway network is generated by selecting the rules 
with a high level of accuracy. 
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Table 1 Experimental factors and levels used to design 
the simulated datasets. 



Factors 




Levels 




Number of genes 1 0 60 


100 


140 


200 


Type of profile 1 2 


3 


4 


5 6 7 


Inner correlation percentage 0 0 2 


0.4 


0.5 


0.6 0.7 0.8 



experimental factors and levels used to design the simu- 
lated datasets. In total, 245 different pathway designs 
were obtained by the combination of the three experi- 
mental factors and expression submatrices were gener- 
ated for each of them using a multivariate normal 
distribution. The coefficient s used for the definition of 
the sigma parameter (the covariance matrix of this distri- 
bution) represents the level of noise in the generated 
data. The value of s was fixed in 0.01 for this experiment. 
We evaluated the performance of the PANA algorithm as 
a function its control parameters alpha, which modulates 
the extraction of PPs, and accuracy, that controls the 
identification of pathway links, and of the noise in the 
dataset defined by s (Figure 2). 

First, PANA results were obtained on the synthetic data 
with a fixed s (0.01) value and varying levels of alpha and 
accuracy. Figure 2A shows the correlation value between 
PPs in extracted direct rules as a function of the PANA 
control parameters (left axis) and the percentage of rules 
that have pathways sharing the same SEP (right axis). As 
expected, increasing accuracy and decreasing alpha 
values associated with selected rules involving pathways 
with increasing correlation. For example, at an accuracy 
level of 0.9 and alpha of 0.05, the average correlation 
value in rules was 0.95 and 76% of the links involved 
pathways with the same SEP. In a second simulation 



experiment we evaluated the robustness of PANA to dif- 
ferent noise levels. Figure 2B indicates that PANA finds 
relative constant number of direct rules at different noise 
levels (left axis), with slight decrease with s > = 0.1, and 
also a constant percentage of direct rules involving path- 
ways with the same SEP (right axis). Similar observations 
were obtained when considering opposite rules (Addi- 
tional file 1, Figure SI). From these results we concluded 
that control parameters on PANA correctly capture the 
correlation structure within the dataset and that the algo- 
rithm is robust to different levels of noise in the data. 
The yeast cell cycle network obtained by PANA 
The yeast cell cycle gene expression dataset used in the 
first experiment was published by Spellman et al [26]. 
This dataset contains microarray gene expression mea- 
surements at 24 time points of the yeast cell cycle syn- 
chronized by cdclS. We used this dataset to validate our 
methodology since it describes a well-known cellular sys- 
tem and there is extensive functional information avail- 
able on yeast genes. The KEGG database was used as a 
pathway annotation scheme, and 112 pathways associated 
with the yeast genes were found. 

Similarly to the results with the synthetic data, the 
number of pathway links inferred by PANA decreased 
with more restrictive alpha values and higher accuracy 
thresholds (Additional file 1, Table S2). Next, we rea- 
soned that if the PANA methodology truly captures the 
functional links between pathways, the algorithm para- 
meters will also control the biological consistency of the 
generated network. In order to evaluate the functional 
coherency of the different pathway network sets, the 
functional annotation data contained in the YeastNet2 
database [27] were employed. This database contains 
102,803 functional associations among 5,483 yeast genes. 




0.70 0.75 0.80 0.85 0.90 0.95 0.01 0.05 0.1 0.15 0.2 



Accuracy Noise level 

Figure 2 Performance of PANA on synthetic data. A) Performance of control parameters. Average correlation among network pathway 
profiles (left axis, red polygon) and percentage of links with same simulated expression profiles (right axis, blue polygon) as a function of the 
accuracy threshold. The upper and lower border of polygons indicate the range of variation at different alpha values. Noise level was set at 0.01. 
B) Robustness of PANA to experimental noise. Number of selected rules (left axis, red) and percentage of links with same simulated expression 
profiles (right axis, blue) as a function of the noise in the data. Alpha was set to 0.01 and accuracy to 0.9. 
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Each gene pair-wise relationship has an association score 
(AS), which integrates the degree of evidence obtained 
from the different types of data sources (gene co-citation 
in text mining, protein-based functional linkages, micro- 
array expression correlations, and so on) in a normalized 
value. In our work, network validation was performed 
according to two AS metrics: the integrated AS, or the 
Bayesian AS {bAS) that uses a Bayesian method to inte- 
grate functional evidences; the AS obtained exclusively 
from microarray data, denoted here as the microarray AS 
(mAS). From these AS metrics, and given any pair of 
pathways i and /, the functional association strength 
among these pathways (&ASp,y or wASp,y) was calculated 
in terms of the bAS and mAS of the pathway genes. A 
more detailed explanation about the YeastNet2 data and 
the method used for computing the bASp and wASp are 
provided in the Methods section. 

Figure 3 shows the relationship between the mean bASp 
and mASp values of the inferred networks, denoted as 
bASn and wASn respectively, and their alpha and accu- 
racy values. The plot reveals that as the alpha value 
decreases, the bASn of the identified pathway associations 
increases; i.e., the functional support of the inferred net- 
work is higher (Figure 3A). Regarding the accuracy para- 
meter, the bASn increases from 0.70 to 0.90, where the 
maximum value is reached. The fact that bASn decreases 
in the highest accuracy range is a consequence of the 
reduced network size at these levels. When accuracy 
changes from 0.90 to 0.95, a few highly connected path- 
ways drop, which has a major impact on the bASn of the 
already sparse network. When the wASn is considered, 
the relationship with the PANA parameters is similar, but 
the maximum value is reached at 0.85 (Figure 3B). The 
absolute values for mASn are lower as this index exclu- 
sively uses evidence from the co-expression data. 

Taken together, these results indicate that the two 
main control parameters of the PANA algorithm (the 



alpha value for the component selection in the first step 
and the accuracy value for rule inference in the second 
step) work consistently and are efficient in deriving 
pathway networks that are coherent with available biolo- 
gical information. From the obtained results, we selected 
the yeast cell cycle pathway network obtained with an 
alpha value of 0.001 and an accuracy value of 0.90 for 
further analysis. This network has been chosen because 
it has the highest bASn value (92.37) and 252 associa- 
tions. Therefore from this point onwards, any mention 
of the yeast cell cycle PANA refers to this particular 
network. 

Functional significance of pathway links obtained by PANA 

The yeast cell cycle PANA (YCCPN) is presented in 
Figure 4. Additional information about YCCPN links, 
such as bASp and wiASp values, the number of genes of 
the linked pathways, driving genes and other relevant 
information is included in the website of the PANA pro- 
ject at http://pathwaynetworkanalysis.org (see YCCPN 
website section). Several tests were designed to determine 
the relevance of the pathways associations inferred by the 
method. First, we asked whether the pathway links within 
the YCCPN provided greater functional evidence than 
expected by chance. For this purpose, the universal set of 
all possible yeast pathways associations and their bASp 
values were computed from YeastNet2, resulting in 2,541 
possible associations with a bASp higher than zero. In 
Table 2, the distribution of the bASp values in the uni- 
versal set and the YCCPN are presented in terms of 
twelve different percentile values. From this table, it can 
be concluded that approximately 70% of the rules 
included in the YCCPN correspond to the first quartile 
of the universal set of pathway associations. Moreover, 
40% of the YCCPN rules correspond to the 90% percen- 
tile of the universal set. Similar results were obtained 
when the network was compared to the bASp values 
obtained from randomly generated gene blocks of the 
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Figure 3 Relationship between alpha and accuracy PANA parameters and the mean association strength of the resulting network, 
considering either the Bayesian bASn (A) or Microarray mASn (B) index, computed for the yeast cdc15 dataset 
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Figure 4 The pathway network obtained from the yeast cdc15 dataset using an alpha value of 0.001 and an accuracy cutoff value 
equal to 0.90. The pathway nodes filled with the same color belong to the same KEGG pathway category. The bASp values are shown on the 
edges. Blue and red edges represent direct and opposite correlations, respectively. The thick links denote the pathway associations analyzed in 
detail. This network can be interactively browsed at http://pathwaynetworkanalysis.org. 



yeast genome (Additional file 1, Figure S2), indicating 
that the association strength of the rules in the YCCPN 
were significantly higher than what it would be obtained 
by random pairing of pathways. 

Subsequently, the reference ASp value distribution 
analysis was repeated using the raAS values; i.e., the uni- 
versal set of pathway associations integrated exclusively 
of the links supported by microarray data, resulting in 
1,350 potential associations among pathways. When the 
mASp distribution was compared to the YCCPN wASp, 
we observed that approximately 30% of the links with 
the highest wASp values included in the network corre- 
sponded to the 80% percentile of the universal set of 



pathway associations. Moreover, we have evaluated the 
number of rules with gene commonalities in the 
YCCPN and found that 76.19% of the associations cor- 
responded to pathways that had no genes in common, 
indicating that the presence of shared genes between 
pathways is not the underlying mechanism of the 
inferred networks. 

The conclusions of these analyses are two-fold. On the 
one hand, we demonstrate that the YCCPN contains a 
significant enrichment of pathway associations of strong 
functional links according to available biological knowl- 
edge. On the other hand, our approach, even when using 
only gene expression data, is able to capture relationships 



Table 2 Comparison of the percentile breakdown of the bASp and mASp distributions for the universal set of 
pathways rules and YCCPN. 
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between pathways that are evidenced by other types of 
functional information and cannot be found by univariate 
co-expression analyses. This is supported by the fact that 
this enrichment is higher when bAS -collecting multiple 
sources of functional evidence- is considered instead of 
wAS (only microarray data). This claim is also consistent 
with the fact that the identified associations go beyond 
the presence of shared genes between pathways. 

Biological relevance of PANA results 

We have demonstrated that the PANA approach unra- 
vels a network of connections between pathways, and 
that it is backed up by functional data. The next question 
is how these links can be interpreted in terms of their 
biological meaning. Our approach to this is the detailed 
analysis of the molecular function of the pathway driving 
genes. Driving genes are those genes that contribute the 
most to the definition of the pathway signature can be 
understood as fundamental pieces in their regulation (see 
Methods). We hypothesized that these genes can reveal 
the functional relationships between pathways and aid in 
the interpretation of the pathway network links. To help 
with this discussion, we refer readers to the PANA site 
where a fully hyperlinked YCCPN can be browsed. For 
notation purposes, driving gene names have been under- 
lined in this section. 
Cell cycle and DNA replication pathways 
These two pathways are strongly associated (accuracy: 
91.66%, &ASp: 304.97) and conform a cluster to other 
three DNA repair pathways. The connections between 
these processes are well documented by the literature and 
represent a suitable example to demonstrate the molecular 
fundaments of the pathway links. They share six genes 
corresponding to the MCM complex. However, none of 
those were selected as driving genes. Instead CLB6, 
CDC45, MCD1 (sscl) and RADS3 included in Cell Cycle, 
and POL30 and RFA1 annotated to the DNA Replication 
pathway, were identified by minAS as major contributors 
to the connected pathway profiles. Note that RAD53 is 
annotated as DNA replication and DNA repair by other 
databases such as Saccharomyces Genome Database. 

The regulation in eukaryotic cell cycle occurs during 
the transitions from the Gl to the S phase and from the 
G2 to the M phase [28]. These regulatory transitions 
strongly synchronize the cell cycle to DNA replication by 
means of several check-point proteins. Several of these 
proteins belong to the driving gene set of the cell cycle 
pathway, which explains the high bASp obtained for this 
pathway link. For example, CLB6 stabilizes the S phase 
by promoting DNA replication while inhibiting other cell 
cycle activities. CDC45 is an essential protein for the 
initiation of DNA replication [29]. MCD1 is present dur- 
ing DNA replication and participates in the establish- 
ment of sister chromatid cohesion [30]. MCD1 is also 



required throughout the G2 and M phases to maintain 
cohesion [31]. Finally, RADS3 encodes a kinase, which is 
activated during DNA replication when DNA damage is 
detected. This kinase slows down the replication rate to 
promote DNA repair processes [32]. POL30 is the Prolif- 
erating Cell Nuclear Antigen (PCNA), a protein that acts 
as a processivity factor for DNA polymerase 5 in eukar- 
yotic cells. In response to DNA damage, this protein is 
ubiquitinated and involved in the 7?AD6-dependent DNA 
repair pathway. 

Recent studies have identified strong correlations 
between genes POL30 and MCD1 [33]. In particular, they 
examined the expression of four genes (MCD1, POL30, 
CLB2, and SUR7), whose periodic expression during the 
yeast cell-division cycle is well established. From these 
experiments, the POL30-MCD1 pair achieved a higher 
level of correlation for synchronized (0.86) and unsynchro- 
nized (0.75) samples, proving that the simultaneous 
expression of both genes is an intrinsic feature of yeast 
growth. Therefore, there is clear evidence for strong tem- 
poral pattern matching between these driving genes. RFA1 
is a subunit of Replication Protein A Complex (RPA). 
There is evidence for the regulatory action of RAD53 on 
RPA during the early S phase [34], and also on other pro- 
teins involved during DNA replication initiation. There- 
fore, all these genes play an important role in the 
synchronization between the cell cycle and the DNA repli- 
cation process; hence their expression profiles are also clo- 
sely matched. This is illustrated in Figure 5A, which 
depicts gene expression data for the driving genes, 
together with the pathway profiles of the Cell Cycle and 
DNA Replication pathways. Both pathway profiles are 
strongly correlated, like their driving genes, all of which 
show maximum activity in the Gl and S phases. 
Glycolysis/gluconeogenesis and oxidative phosphorylation 
pathways 

This association is an example of a negative relationship 
between two pathways; i.e., basically, their pathway pro- 
files are negatively correlated. However, the link has a 
high bASp value (81.47) and an accuracy of 100%. No 
genes are shared by the two pathways. 

Eukaryotic cells produce energy in the form of ATP 
molecules by two different pathways: via glycolysis and 
by oxidation of glucose to ethanol or lactic acid. In parti- 
cular, Cytochrome c-oxidase (COX), the terminal enzyme 
of the mitochondrial respiratory chain (MRC), plays a key 
role by regulating the rate-limiting step of respiration. 
This regulation mechanism facilitates aerobic ATP pro- 
duction. An analysis of the driving genes of the oxidative 
phosphorylation pathway signature, COX12, COX13, 
COX17, CYT1, VMA4 and QCR6, reveals the important 
roles of the genes associated with the cytochrome c-oxi- 
dase complex and ATP production in the conformation 
of the temporal profile of this pathway, indicating that 
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Figure 5 Expression profiles of driving genes and pathways. A) PPs for Cell Cycle and DNA replication pathways and expression profiles of 
CLB6, CDC45, MCD1 and RAD53 (Cell Cycle driving genes), POL30 and RFA1 (DNA Replication driving genes). All the genes are overexpressed in 
the G1 and the S phases simultaneously with DNA replication, and their activation patterns are consistent with their pathway temporal profiles. 
B) Gene expression for driving genes of the Ubiquitin-mediated Proteolysis (SAF1) and Purine Metabolism (AAH1, CDC34, SKP1) pathways. The 
SAF1 gene shows an inverse correlation to the expression of the AAH1, CDC34 and SKP1 genes. 



COX activity and ATP synthesis are essential for the 
interpretation of this signature. 

On the other hand, it is well known that the glycolytic 
flux is conditionally correlated with the ATP concentra- 
tion in yeast. In particular, there is a strong negative 
correlation between glycolytic flux and intracellular ATP 
content; i.e., the higher the ATP content, the lower the 
glycolysis rate. Moreover, glycolytic enzymes HXK2 and 
ENOl drastically reduce with an increasing flux [35]. 
When considering the glycolysis/gluconeogenesis signa- 
ture pathway, the negative loading values of enzymes 
HXK2 and ENOl and the identification of several glu- 
cose-repressed proteins as driving genes, such as ACS1, 
GAL10 and ALD3, suggest that this signature represents 
a situation where the glycolysis pathway is active, but 
reaching ATP saturation. This ATP concentration is 
promoted, in part, by the activity of the Oxidative Phos- 
phorylation pathway, as indicated for the driving genes 
associated with COX and ATP syntheses. In other 
words, the negative link between Glycolysis and Oxida- 
tive Phosphorylation reflects the opposite involvement 
of ATP concentration in both pathways: while ATP pro- 
duction reflects the activity of the Oxidative Phospohor- 
ylation pathway through the action of its driving genes, 
high ATP levels down-regulate Glycolysis and modulate 
the expression of glucose-related genes. Note that this 
link is not explained by the presence of common genes 
between the pathways or protein-protein interactions, 
but by the action of a metabolic regulatory element. 



Ubiquitin-mediated proteolysis and purine metabolism 
pathways 

The previous two examples represent connections 
between pathways that are well established by the scienti- 
fic literature. Here, we discuss a pathway association that 
might not be so evident. The link between ubiquitin- 
mediated proteolysis and the purine metabolism has a 
bASp of 96.77 and an accuracy of 100%. Moreover in this 
case, no genes are shared by the two pathways. 

Several driving genes were identified for these linked 
pathways. Specifically, we focus on the interaction and 
activation patterns of SKP1 and CDC34 from the Ubiqui- 
tin-mediated Proteolysis, and on AAH1, POL32, RPA34 
and RPC19 from the purine metabolism pathway. SKP1 
is an evolutionarily conserved kinetochore protein that 
forms part of multiple protein complexes, including the 
SCF ubiquitin ligase complex. CDC34 is an ubiquitin- 
conjugating enzyme (E2) and a catalytic subunit of the 
SCF ubiquitin-protein ligase complex (together with 
SKP1, RBX1, CDC53, and an F-box protein), which regu- 
lates cell cycle progression by targeting key substrates for 
degradation. AAH1 is the adenine deaminase-encoding 
gene and plays a central role in the salvage adenine path- 
way. There is a well-known relationship between the 
driving genes of the ubiquitin-mediated proteolysis and 
AAH1, which occurs during cell starvation. In response 
to nutrient limitation, S.cerevisiae cells enter a non-pro- 
liferating state termed quiescence. AAH1 is among the 
most tightly regulated genes upon entry into quiescence. 
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Escusa et al [36] showed that AAH1 regulation at this 
stage is conducted by the gene SAF1, but its regulatory 
role is dependent on genes SKP1 and CDC34. Figure 5B 
presents the expression pattern of both these driving 
genes and SAF1, and reveals the negative correlation 
between SAF1 and AAH1, which is consistent with the 
negative regulatory role of the former on the latter. We 
postulate that this regulatory role of the ubiquitin- 
mediated proteolysis genes on key Purine Metabolism 
gene AAH1 might not be restricted to the quiescence 
process, but may operate during cell cycle progression, 
thus explaining the link between these two pathways. 

The other driving genes of the Purine Metabolism are 
related with the synthesis, repair and degradation of 
DNA and RNA. In particular, POL32 codifies a third sub- 
unit of DNA polymerase 8, involved in chromosomal 
DNA replication and required for error-prone DNA 
synthesis in the presence of DNA damage. An association 
between POL32 and the ubiquitin system occurs during 
DNA replication and repair processes, where a small ubi- 
quitin-related modifier (SUMO) and ubiquitin jointly 
affect a key signal integrator at the replication fork, 
PCNA [37,38]. Papouli et al [39] found that SUMO and 
ubiquitin cooperatively control the choice of pathway for 
the processing of DNA lesions during replication. This 
interaction is mediated by the recruitment of helicase 
SRS2 in order to inhibit DNA recombination; in particu- 
lar, Pfander et al [40] presented evidence that POL32 
SUMOylation is essential for the recruitment of this heli- 
case. This result is concordant with other works [41,42], 
which suggested that the SUMO modification of yeast 
PCNA increases the activity of translesion DNA polymer- 
ase and inhibits a recombination-dependent bypass 
mechanism. Therefore, the overexpression of POL32 is 
consistent with a simultaneous activity of the ubiquitin- 
mediated proteolysis pathway. 

PANA to unravel differential pathway connections in 
disease 

The yeast cell cycle analysis showed how PANA can 
describe pathway interconnections along a time course of 
events. In this section we evaluated how effective the 
PANA approach would be in studying differential path- 
way connectivity associated to a disease. For this, we 
used two microarray datasets generated for the study of 
Alzheimer Disease (AD). Both datasets were downloaded 
from the Gene Expression Omnibus (GEO) database 
http://www.ncbi.nlm.nih.gov/geo/ and were previously 
used by Dutta el al [20] for the detection of pathways 
associated with AD. 

The first dataset (GEO ID: GDS810) [43], studied the 
expression profile of genes from the hippocampal region 
of the brain as a function of the progression of the disease 
(incipient, moderate, and severe). The second dataset [44] 



explored the effect of AD in six different brain regions: the 
entorhinal cortex, hippocampal field CA1, middle tem- 
poral gyrus, posterior cingulate cortex, superior frontal 
gyrus, and primary visual cortex (GEO ID: GSE5281). 
Since different regions of the brain are involved in control- 
ling different biological processes, this dataset can provide 
insights into the tissue-specific activation of pathways. The 
entorhinal cortex region samples were obtained from 
patients in the early stages of AD, while the remaining 
samples were obtained from patients in the later stages of 
the disease. Dutta et al [20] specifically analyzed pathways 
that have statistically significant association with the AD 
pathway (KEGG hsa05010). The analysis focused on six 
conditions (moderate and severe samples in the disease 
progression dataset; and primary visual cortex, hippocam- 
pal field CA1, middle temporal gyrus, and posterior cingu- 
late cortex regions in the brain regions dataset), where the 
AD pathway was found statistically enriched. Those path- 
ways associated to the AD are at least 3 conditions were 
selected as relevant associations. 

We have applied PANA to the same set of conditions. 
For the six cases, the pathway network corresponding to 
the disease and control samples were computed and con- 
trasted, and we asked which pathways most frequently 
modify their association with the AD pathway when 
switching from healthy to disease status. The analysis 
revealed that PANA detects most frequent associations 
reported by Dutta, but in some cases with a lower fre- 
quency (Table 3). Notably, there were also new recurrent 
associations inferred only by our method. All new asso- 
ciations obtained by PANA that occur at least in three 
experiments are listed in Table 4. The last two columns 
contain the number of genes shared with the AD path- 
way and literature references that support the new asso- 
ciations [45-59]. 

A new rule is related with the Focal Adhesion pathway. 
Alzheimer's disease is a neurodegenerative disorder that 
results from a loss of synaptic transmission and ulti- 
mately cell death. The presenting pathology of AD 
includes neuritic plaques composed of beta-amyloid pep- 
tide (AP) and neurofibrillary tangles composed of hyper- 
phosphorylated tau, with neuronal loss in specific brain 
regions. In the other hand, focal adhesion proteins 
assemble into intracellular complexes involved in integ- 
rin-mediated communication between the extracellular 
matrix and the actin cytoskeleton, regulating many cell 
physiological processes including the cell cycle. Remark- 
ably, recent studies report that integrins bind to Ap 
fibrils, mediating Ap signal transmission from extracellu- 
lar sites of Ap deposits into the cell and ultimately to the 
nucleus. In particular, Caltagarone et al [54] discuss how 
the Ap induced integrin/Focal Adhesion signaling path- 
ways mediate in cell cycle activation and cell death dur- 
ing AD progression. Other novel association occurs with 
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Table 3 List of pathways more frequently associated with the AD pathway reported by Dutta et al [20]. 



Pathway KEGG id Shared Genes with AD pathway Detected by PANA? 
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Adherens junction 


hsa04520 


2 


YES 


Axon guidance 


hsa04360 


11 


YES 


Do rso- ventral 


hsa04320 


2 


NO 


Insulin signaling 


hsa04910 


10 


YES 


Long-term depression 


hsa04730 


11 


YES 


Long-term potentiation 


hsa04720 


29 


YES 


Neurotrophin signaling 


hsa04722 


12 


YES 


Oocyte meiosis 


hsa041 14 


18 


YES 


Pathways in cancer 


hsa05200 


11 


YES 


Ubiquitin mediated proteolysis 


hsa04120 


0 


NO 



For comparison purpose, the third column shows the shared genes with the AD pathway and the fourth column indicates which associations are also detected 
by PANA. 



the Peroxisome pathway. In Alzheimer's disease lipid 
alterations are present early during disease progression. 
Some of these alterations point towards a peroxisomal 
dysfunction. Peroxisomes are present in all nucleated 
human cells, including all cell types of the brain, and per- 
form anabolic and catabolic functions and play a major 
role in generation and decomposition of plasmalogens 
and docosahexaenoic acid. The levels of both of these 
lipids are decreased in brains of patients suffering from a 
generalized peroxisome biogenesis deficiency (Zellweger 
syndrome spectrum) and in AD. In particular, Kou et al 
[50] observed that the decrease in plasmalogens and the 
increase in VLCFA (very long-chain fatty acids) and per- 
oxisomal volume density in neuronal somata showed a 
stronger association with neurofibrillary tangles than 
with neuritic plaques. Therefore, these results indicate 
substantial peroxisome-related alterations in AD, which 
may contribute to the progression of AD pathology. 
Another example is the link with the VEGF (vascular 
endothelial growth factor) pathway. VEGF, a critical 
mediator of angiogenesis, is present in the AD brain in 
the walls of intra-parenchymal vessels, in diffuse perivas- 
cular deposits, and in clusters of reactive astrocytes. In 



addition, intrathecal levels of VEGF in AD are related to 
clinical severity and intrathecal levels of amyloid-beta 
(AP). Emerging data support the idea that factors and 
processes characteristic of angiogenesis are found in the 
AD brain [52]. Rosenstein et al [53] also discuss about 
the role of VEGF in the perfusion deficits related with 
neurodegenerative disorders, such as Alzheimer and 
Huntington diseases, suggesting that problems in vascu- 
lar tone regulation contributes to the pathogenesis of 
these disorders. 

Interestingly, and as mentioned before, PANA was 
able to detect associations between pathways that only 
share few genes or even none (i.e. Peroxisome and AD 
pathways do not share genes). This contrasts with Dut- 
ta's results where the median number of shared genes 
linked to the AD pathway is 11. For PANA new rules 
this number drops to 3. This result is a direct conse- 
quence of fundamental differences between both algo- 
rithms. Dutta's method is oriented towards the 
topological information of the pathways (where the 
shared genes play a central role), whereas our methodol- 
ogy connects pathways based on their shared activity 
profiles. Still, the literature survey returns evidence of 



Table 4 Pathways associated with the AD pathway obtained exclusively by PANA method. 


Frequency 


Pathway 


KEGG id 


Shared genes with AD pathway 


Literature evidence 


4 


Citrate cycle (TCA cycle) 


hsa00020 


4 


[46] 


4 


Pyruvate Metabolism 


hsa00620 


0 


[47,48] 


3 


MAPK signaling 


hsa04010 


19 


[49,50] 


3 


Peroxisome 


hsa04146 


0 


[51,52] 


3 


VEGF signaling 


hsa04370 


11 


[53,54] 


3 


Focal adhesion 


hsa04510 


5 


[55] 


3 


Aldosterone-regulated sodium reabsorption 


hsa04960 


2 


[56,57] 


3 


Carbohydrate digestion and absorption 


hsa04973 


2 


[58,50,60] 



Only those rules that occur at least in three experiments are listed. 
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functional connections between AD and these pathways 
as discussed before. 

Discussion 

Understanding the complexity of molecular interactions in 
a cellular system is one of the most challenging aspects of 
current genomics research. Many analysis approaches 
have been developed in recent years and have attempted 
to exploit functional information and multivariate analyses 
to provide answers about molecular systems functioning. 
These approaches rely on the systems biology concept; 
hence they analyze the collective behavior of groups of 
genes. In this work, we take one step forward by present- 
ing a methodology that not only studies blocks of genes 
jointly, but also establishes relationships between these 
blocks. The result is a global, interconnected view of the 
system's transcriptional status. 

There are some substantial differences between the 
PANA approach and most functional Enrichment Analysis 
methods. Probably, the most relevant one is the way that 
PANA extracts information from the gene set (or func- 
tional block), while the EA methods typically rely on identi- 
fying a significant majority of gene set members associated 
with the phenotype and consider all the genes equally con- 
tributing to the block's functionality, PANA is built upon 
the analysis of the correlation structure within the group of 
functionally related genes (for example, by forming part of 
a same sub-pathway, as exemplified in this work). 

Both the covariation analysis and the feature extrac- 
tion algorithm imply that the functional block hosts a 
level of transcriptional variation which is above a given 
noise threshold (see Methods) and that this might be 
concentrated in a subset of pathway genes. This proce- 
dure is able to address situations where pathways are 
roughly defined, include genes that are not necessarily 
co-expressed or when the regulation of the pathway is 
concentrated in a low number of switch genes. For 
example, the KEGG Purine Metabolism pathway (PMP) 
in yeast, present in our YCCPN results, includes reac- 
tions involving purine nucleotides and it branches out 
to histidine and thiamine metabolisms, sulfur assimila- 
tion and allantoin degradation pathways, among others. 
These other sub-pathways are not particularly seen as 
being highly regulated in our analysis. Additionally, 
ADE4, the first committed step in purine biosynthesis 
by catalyzing the reaction of PRPP, water and glutamine 
to 5'phosphoribosylamine, is identified as a driving gene 
in our analysis. ADE4 overexpression, but not the activ- 
ity of other ADE genes, was found to increase purine 
biosynthesis in yeast [60]. PANA results are in agree- 
ment with these prominent regulatory roles of some 
pathway genes. 

Another differential characteristic of the PANA 
approach is that the links between pathways do not 



derive from shared components or protein-protein inter- 
action data, but exclusively from a co-transcriptional 
analysis. Co-expression has been largely used to infer 
gene regulatory networks [61-63], but these approaches 
normally ignore the participation of the genes in path- 
ways and hence are limited in providing a global func- 
tional interpretation of the results [64]. This is the case 
of popular approaches such as WGCNA [65] -targeted 
to create scale-free networks from gene co-expression 
analysis- and GeneMANIA [66], focused in the integra- 
tion of multiple gene association networks. We have 
shown that YCCPN connections are supported by func- 
tional evidence that goes beyond the gene expression 
data contained in the YeastNet2 database, and hypothe- 
sized that the pathway-centered multivariate analysis 
basis of our approach might be more robust in identify- 
ing functional transcriptional connections than pair-wise 
gene expression analyses. We have also shown in two 
examples that these transcriptional links can be 
explained by the action of molecular features that are 
not part of the connected pathways themselves. Such is 
the case of the Glycolysis/Gluconeogenesis and Oxida- 
tive Phosphorylation Pathways, which are regulated by 
ATP levels, and the Ubiquitin-mediated Proteolysis and 
Purine Metabolism Pathways, which are connected by 
the regulation of the SAF1 protein. This is an interesting 
property because it makes the approach amenable for 
application in situations of insufficient or misplaced 
pathway database annotation or when common regula- 
tory elements are not proteins. 

An unique PANA feature is the possibility of presenting 
different aspects of pathway behavior when the dimension 
reduction step results in multiple principal components 
being selected as pathway profiles. Each one can establish 
links with different pathways. For example, the Cell Cycle 
pathway is represented in the YCCPN by three profiles 
corresponding to principal component one, two and four 
of the PCA of the Cell Cycle gene expression matrix. Cell 
cycle_l collects most of the canonical controllers of cell 
cycle progression and is linked to several DNA processing 
pathways, as discussed in the Results section. However, 
Cell cycle_4 presents a profile of activation at late time 
points of the yeast experiment. This profile is connected 
to the Glycine, Serine and Threonine Metabolism path- 
ways, and indirectly to the Galactose Metabolism pathway 
(Figure 5). One of the driving genes of Cell Cycle_4 is 
PH085, which negatively controls the expression of 
numerous genes induced under nutrient limitation condi- 
tions [67]. One of these repressed genes is UGP1, which 
catalyzes the reversible formation of UDP-Glc, a source 
compound in glycogen and trehalose biosynthesis. In our 
analysis, UGP1 is the driving gene of the Galactose Meta- 
bolism pathway and is negatively correlated with PH08S. 
Moreover, two driving genes in the Glycine, Serine and 
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Threonine Metabolism pathways, GCV2 and CHA1, are 
related to nitrogen utilization under nutrient-limiting con- 
ditions. Hence the fourth profile of the Cell Cycle pathway 
might witness the coordination of this pathway with the 
nutritional state of the yeast cell. 

Another interesting application of PANA is to unravel 
changes in pathway connectivity that associate to a given 
phenotype. This is relevant not only to understand the 
new functional status acquired in a disease situation, but 
also to explore possible side effects of treatments. Meth- 
ods for differential molecular wiring have been described 
at the gene level [68,69] and have shown that differences 
in gene co-expression patterns rather than absolute 
expression level differences can determine phenotypic 
differences. In the Alzheimer dataset example we extend 
this concept to the pathway level and show that, by com- 
paring the pathway network of healthy versus diseased 
individuals we can spot pathway connections that consis- 
tently change in Alzheimer patients. Some of these new 
connections can be detected by methods based on shared 
protein components but many other relevant ones were 
only found by our methodology. 

PANA was developed in the microarray analysis context, 
but can be extended to other high-throughput methodolo- 
gies provided that a functional database is available for fea- 
ture annotation. The adaptative association rule algorithm, 
used for network construction, recommends evaluating 
the expression along a sufficient number of samples. This 
might preclude the utilization of this approach in reduced 
sample size experiments, but does not restrict the method 
to time series data. Besides, case control studies and multi- 
factorial designs are potential experimental set-ups for 
PANA. On the other hand, the dimension reduction tech- 
nique used in the first algorithm step, PCA, analyzes cov- 
ariation across the entire data matrix. Other multivariate 
analysis approaches, such as biclustering or spectral ana- 
lyses, might extend the possibilities of the method to iden- 
tify the pathway profiles associated with a restricted 
number of samples and to fine-tune the network analysis 
to specific conditions within the experimental design. 

In summary, we propose a novel method for the inter- 
pretational analysis of high-throughput data in systems 
biology research. This approach not only offers global 
views of the interconnections among the different func- 
tional blocks of the system, but also allows focusing on 
these links to reach the molecular basis of the network. 
We believe PANA is a useful tool to improve our under- 
standing of the functional interdependencies that operate 
within complex biological systems. 

Methods 

The PANA algorithm 

The proposed approach relies on the combination of 
dimensionality reduction methods with machine-learning 



techniques. Given a gene expression experiment and an 
annotation scheme for genes in pathways or functional 
modules, this method creates a gene expression subma- 
trix for each pathway and uses a Principal Component 
Analysis (PCA) to reduce the dimensionality of the path- 
way expression data [22]. Each pathway will be repre- 
sented by one or a few pathway signatures or pathway 
profile (PP), which collect most of the gene expression 
variation within the pathway and represent the pathway 
activity changes along the experiment. These PPs are 
used as input data to derive adaptive association rules 
based on mutual information maximization [24]. These 
rules can be seen as the covariation relationship between 
PPs and can be represented in the form of a network of 
pathway interactions with direct and opposite links 
depending on the direction of the rule. Hence, the net- 
work inference methodology consists of two main phases; 
pathway compression and association rule inference, 
which are described below. 
Phase 1: Pathway compression 

Given a transcriptomics experiment, let X be the expres- 
sion data matrix of dimension N x M, where N is the 
number of genes measured and M is the total number of 
samples. Let x nm ml ... N, m:l ... M be the expression 
value of gene n in sample m. Let F be the set of functional 
annotation (pathways) of the genes in the transcriptomic 
dataset. Let Nf be the number of genes associated with 
each pathway /e F. 

1. For each fe F, create the expression submatrix of 
X, Xf, with the Nf rows corresponding to the genes asso- 
ciated with pathway /and with the same M columns as X. 
Obtain X f c as the transposed, column-mean centered 
transformation of X f . 

2. For each pathway f, obtain a number of pathway 
signatures hf, by applying a Principal Component Analy- 
sis- (PCA) based procedure that uses bootstrapping to 
obtain pathway signatures with a given confidence 
alpha according to the following procedure: 

a) Given the original expression matrix X with M 
columns, sample M columns with replacement to 
obtain X r . Use X r to calculate the variance of each 
gene. Approximate the gene variance distribution by 
a Gamma distribution as described [70] and obtain a 
gamma cutoff value as the 1-alpha percentile value 
of this distribution. 

b) Apply PCA to each bootstrap pathway submatrix 
X/ and select the principal components (PC) with 
variance (eigenvalues) higher than the gamma cutoff. 
Let PCi, PC 2 , PC k ( r) the selected PCs for matrix 
X f r , where 1 < k(r) < rank(X f r ). 

c) Repeat 3 and 4 R times. Let H f be the set of all 
the selected PCs in the R repetitions: H f = {1,..., k(l), 
1,..., k(2), .... 1,..., k(R)}. Hence, each ieH f has a 
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frequency q^. Select z'GH f with frequency q^ higher 
than a Q threshold (typically 95%) as the number hf 
of pathway signatures for pathway / 

3. Given the hf principal components selected using 
the criterion described above, the PCA decomposition 
of submatrix X f c can be written as: X f c = TfP/ + E, 
where Tf is the scores matrix for pathway/ (with dimen- 
sions M x hf), Pf is the loadings matrix for pathway / 
(with dimensions Nf x hf) and E is an error term. The 
scores matrix T f represents the hf pathway signatures 
for pathway f. These new hf functional variables repre- 
sent the coordinative expression patterns of the genes 
associated with pathway / Pf collects the contribution of 
each gene to each pathway signature. 

4. Create a pathway level matrix (PLM) through the 
row-wise concatenation of the T f scores matrices of all 
the pathways with at least one selected pathway signa- 
ture. Hence, all the pathway signatures selected during 
Step 5 are included in the PLM, which has Z/ e F hf rows 
and M columns. 

Phase 1 is depicted in Figure 6. 
Phase 2: Inference of association rules 
1, For each pathway signature, make an adaptive discreti- 
zation of PLM into two states: high and low activity levels 
of the pathway signature, represented by values 1 and -1, 
respectively. For example, a relative high PCA score for a 
sample in a pathway signature means that the group of 
genes associated with this pathway are, in general, over- 
expressed in that sample. An adaptive method based on 
the partition entropy metric, typically used in Informa- 
tion Theory, is employed for data discretization [24] . The 
discretization procedure works as follows: discretization 
of the PLM is computed per pathway signature / (row j 
of matrix PLM). The discretized matrix obtained for the 



PLM and pathway signature j (PLMj) is denoted as 
SPLMj. For the computation of 5PLMj a set of discretiza- 
tion thresholds (fy) for each signature pathway / (PLMi), 
with / * j, is calculated in relation to PLMj. The algo- 
rithm for computing considers each score value shown 
by the pathway signature PLM! as a candidate threshold 
t c . Therefore, for each possible t c value, the sample set 
PLMi is partitioned into two subsets, namely S_i and Si. 
S^i contains all the samples where PLM ( has a score 
value that is less than or equal to t c , whereas Si contains 
all the samples where PLMi has a score value that is 
greater than t c . In other words, S_i and Si represent sam- 
ple sets (columns of the PLM) where PLMi has low and 
high activity levels, respectively, on the basis of t c . Next, 
calculate the partition entropy, which is a statistical indi- 
cator of the quality of threshold t c as a discretization 
value for PLM t in relation to the discretization of PLMj. 
The partition entropy is computed from the discretized 
values of rows PLMi and PLMj, where PLMi is discre- 
tized using t c , while PLMj is discretized using its average 
score value. In numerical terms, the value returned by 
this metric is a real number between 0 and 1. When the 
partition entropy value associated with a discretization 
approximates 0, the threshold t c that generates this dis- 
cretization represents a better solution. Consequently, 
the t c that minimizes the partition entropy is selected as 
tij. Details about the equations for the computation of the 
entropy and partition entropy metrics can be found in 
Mitchel [71] and Kohani [72], respectively. 

2. Extract the association rules from the discretized 
matrix by detecting covariation between pairs of path- 
way signatures. The rules inference procedure is applied 
to each 5PLMj in order to determine which pathway 
signatures are linked with pathway signature j (PLMj). 
This task is carried out by a classifier optimization 
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Figure 6 Detailed representation of Phase I of the PANA algorithm: generation of the Pathway Level Matrix (PLM) 
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method [24], which infers association rules and their 
accuracy values. In mathematical terms, the classifiers 
are computed as solvers of the following combinatorial 
optimization problem: 

k 

Umax erf tz\, SPLMA 

i=i 
Where: 

♦ k = number of pathway signatures in PLM (k = Z/^f hf) 

♦ P is the space of all the vectors v of dimension k, so 
that Vi represents a class of association rule Vz, i = 1 ... k, 

♦ SPLMj is the discretization of the PLM for pathway 
signature j, 

♦ jij G P is a classifier of all the rules with an incidence 
on pathway signature /, 

♦ (tT;), 8PLMj is a performance function that evaluates 
the accuracy of a classifier obtained from the SPLMj 
data, 

Therefore, per each pathway j, the inference method 
obtains the pathways linked with j by solving this opti- 
mization problem by combinatorial analysis. Rule accu- 
racy is computed for the O function in terms of the 
well-known sensitivity and specificity metrics by using 
the equation proposed by Carballo and Freitas [25]. 
Therefore, those rules with accuracy values above a pre- 
defined threshold are selected for the network construc- 
tion. The resulting network represents the relationships 
between pathway signatures and has directionality. The 
direction of a network link represents the direction in 



which the association rule holds and indicates logical 
causality. 
Phase 2 is schematized in Figure 7. 

Determination of driving genes 

In order to better interpret the associations between 
pathways, we propose to determine which genes contri- 
bute most to create the pathway profile (PP). As PPs are 
modeled by PCA, loadings represent the contribution of 
each gene to the definition of the PP. Genes with low 
loadings are poorly correlated with the pathway profile, 
while those with high loadings are highly correlated. 
Frequently, a subset of pathway genes can be identified 
as being mainly responsible for the definition of the PC. 
We will refer to these genes as driving genes as they are 
"pulling" the pathway signatures for having the greatest 
weights in the PC. 

To identify driving genes, PANA uses the minAS 
method [70], which is an algorithmic strategy to classify 
features according to the values of a statistic that mea- 
sures the importance of those features in the model. In 
our case, the model is a PCA and the statistics are the 
gene loadings. Usually, in PCA models, each PC is 
defined by a relatively low number of variables while 
most of the variables will have loadings close to zero. 
Accordingly, minAS works under the assumption that 
the distribution of the statistic is at least bimodal: it fol- 
lows a mixed distribution with at least two components. 
The first component (typically with the highest mode) is 
associated with variables with a negligible value of the 
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statistic which are normally of no interest, while the rest 
of components correspond to variables where the statis- 
tic has high values. minAS obtains a cutoff value that 
separates the first component from the rest by firstly 
estimating the density function of the statistic with a 
Kernel Density Estimator and by then computing the 
point where the first local minimum is reached. Hence 
this cutoff is not arbitrary, but consistent with the infor- 
mation contained in the data [70]. Those genes with 
absolute loading values higher than the minAS cutoff 
will be selected as the driving genes of each pathway 
profile. 

Yeast Cell Cycle dataset 

The microarray data used for the inference of yeast cell 
cycle PANA (YCCPN) were published by Spellman et al 
[26]. These expression values were obtained for S. cerevisiae 
cell cultures, which were synchronized by three different 
methods: the cdcl5, cdc28 and alpha factors. The data- 
transformation method used by Spellman et al returned 
background-corrected signal log ratios, with control as an 
average expression level extracted from asynchronous cul- 
tures of the same cells growing exponentially at the same 
temperature in the same medium. For this work, the cdcl5 
experiment was selected as the benchmarking dataset for 
the generation of YCCPN because it contains the largest 
number of data points (24 samples), thus providing the lar- 
gest number of instances for the machine-learning method. 
In this time series, cell cycle progression was blocked at a 
specific point by conditional factor CDC15 which, if 
removed, permits cells to recommence progression through 
the cell cycle in a synchronous fashion. 

Yeast Cell Cycle PANA Network validation 

To validate YCCPN, the functional annotation data con- 
tained in YeastNet2 [27] were used. Approximately 
1800000 individual experimental observations were inte- 
grated into YeastNet2 from ten different types of func- 
tional genomics, proteomics and comparative genomics 
datasets by optimizing a total of -155 free parameters 
to construct the whole network. 

YeastNet2 contains a total of 102803 links covering 5483 
yeast proteins, this which represents 95% of the validated 
yeast proteome and provides an association score (AS) for 
each pair-wise gene relationship. AS values are obtained 
from each kind of experimental evidence separately (i.e, 
gene co-citation in text mining, protein-based functional 
linkages, microarray expression correlations, etc) and 
jointly combining, following a Bayesian method, different 
evidence AS into an unique AS value per gene association. 

In our work, network validation was performed 
according these two AS metrics: the integrated AS 
value, named Bayesian AS (&AS) represents the amount 
of scientific evidence of a putative pathway association 



by taking into account all the biological data sources, 
and the AS value obtained from microarray evidence 
exclusively, named Microarray AS (wAS). From both AS 
metrics, and given a pair of pathways i and j, the func- 
tional association strength between these two pathways 
(£ASpij or wASpij) was computed as the sum of the 
functional association scores of all the gene pairs which 
can be established between these pathways. Moreover, a 
mean bAS and mAS value for each network were calcu- 
lated as the average bASp and wASp values, respec- 
tively, of its pathway associations. Although ASp is a 
summing-up value, there was no correlation between 
the magnitude of ASp and the size of the pathway 
involved (Additional file 1, Figure S3). 

The PANA project website 

The YCCPN described in this paper can be visualized 
and explored on the project web site at http://pat.hway- 
networkanalysis.org. On this web site, some fundamental 
definitions - such as a pathway signature and driving 
gene concepts - are enunciated, and the pathway network 
is depicted in a web-navigable format. The YCCPN figure 
includes zooming capabilities to improve dynamic visua- 
lization. Each network node -pathway signature- is linked 
to a gene expression painted image of the pathway 
obtained with the Paintomics tool [73], which has further 
links to the KEGG data. These images also highlight the 
driving genes by bold-lined boxes. These boxes usually 
have more than one associated gene. For this reason, a 
pop-up window is displayed for each box, where the driv- 
ing genes are denoted by blue filled squares. The network 
edges discussed in this paper are indicated by thick lines 
with hyperlinks to a text document explaining the biolo- 
gical background of the association. 

The pathway signatures of YCCPN are also available as 
a pdf file in the section Additional Information. In this 
document, three plots are included for each pathway sig- 
nature: the pathway signature profile, the loading value 
curve (with an identification of the gene with the highest 
loading value), and the temporal profile that corresponds 
to this gene. This information is useful for understanding 
the direction of the principal component which repre- 
sents the signature pathway. Finally, the annotation of 
samples at cell cycle phases can be found in the section 
Additional Information. In particular, the cdcl5 dataset 
contains 24 samples obtained during 300 minutes [74]. 

Additional material 



Additional file 1: Figure SI. Performance of control parameters for 
opposite rules. Average correlation among Pathway Profiles of opposite 
rules (left axis, red polygon) and percentage of opposite links with same 
simulated expression profile SEP (right axis, blue polygon) in the resulting 
network, as a function of the accuracy threshold. The upper and lower 
border of polygons indicate the range of variation at different alpha values. 
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Noise level was set at 0.01. Figure S2. Association score enrichment 

against random pathways. For each pathway association rule R present in 
the YCCPN, a total of 100 random pathway associations with the same 
cardinality pattern as R were generated, the ASp values computed and the 
percentile position of the ASp of R in its reference distribution was 
obtained. The cardinality pattern of a rule is defined by three values: the 
amount of genes contained in each pathway linked by the rule, and the 
number of shared genes between both pathways. This analysis revealed 
that most (63% of the links) of the rules obtained by the PANA method are 
located in the 20% percentile of the 100 random trials of their gene 
cardinality pattern. In particular, the average bASn of network integrated by 
the random links is low (28.50) in comparison with the bASn of the YCCPN 
(123.61). The difference between YCCPN bASn and the random bASn was 
statistically significant (t-test p-value < 0.05). Figure S3. Independence of 
the ASp score of the pathway size. Relationship between pathway 
association scores (bASp and mASp values) and the number of genes in the 
left (dot) and right (cross) pathways. Lack of correlation is observed in all 
cases. Table SI. Simulated expression profiles (SEP) Tempora 
expression patterns defined for the generation of simulated time series for 
the artificial pathways. Table S2. Network size (number of pathway 
associations) inferred using different accuracy and alpha values in the 
Yeast Cell Cycle network obtained by PANA 
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